Pushing the limits for medical image reconstruction on 
recent standard multicore processors 
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ABSTRACT 

Volume reconstruction by backprojection is the computa- 
tional bottleneck in many interventional clinical computed 
tomography (CT) applications. Today vendors in this field 
replace special purpose hardware accelerators by standard 
hardware like multicore chips and GPGPUs. Medical imag- 
ing algorithms are on the verge of employing High Per- 
formance Computing (HPC) technology, and are therefore 
an interesting new candidate for optimization. This pa- 
per presents low-level optimizations for the backprojection 
algorithm, guided by a thorough performance analysis on 
four generations of Intel multicore processors (Harpertown, 
Westmere, Westmere EX, and Sandy Bridge). 

We choose the RabbitCT benchmark, a standardized test- 
case well supported in industry, to ensure transparent and 
comparable results. Our aim is to provide not only the 
fastest possible implementation but also compare to perfor- 
mance models and hardware counter data in order to fully 
understand the results. We separate the influence of algo- 
rithmic optimizations, parallelization, SIMD vectorization, 
and microarchitectural issues and pinpoint problems with 
current SIMD instruction set extensions on standard CPUs 
(SSE, AVX). The use of assembly language is mandatory 
for best performance. Finally we compare our results to the 
best GPGPU implementations available for this open com- 
petition benchmark. 



Gerhard Wellein 

Erlangen Regional Computing 
Center 
Martensstr. 1 
91058 Erlangen, Germany 
gerhard.wellein@rrze.uni- 
erlangen.de 

1. INTRODUCTION AND RELATED WORK 
1.1 Computed tomography 

Computed tomography (CT) [I] is nowadays an established 
technology to non-invasively determine a three-dimensional 
(3D) structure from a series of 2D projections of an ob- 
ject. Beyond its classic application area of static analysis in 
clinical environments the use of CT has accelerated substan- 
tially in recent years, e.g., towards material science or time- 
resolved scans supporting interventional cardiology. The nu- 
merical volume reconstruction scheme is a key component 
of modern CT systems and is known to be very compute- 
intensive. Acceleration through special-purpose hardware 
such as FPGAs is a typical approach to meet the constraints 
of real-time processing. Integrating nonstandard hardware 
into commercial CT systems adds considerable costs both in 
terms of hardware and software development, as well as sys- 
tem complexity. From an economic view the use of standard 
x86 processors would thus be preferable. Driven by Moore's 
law the compute capabilities of standard CPUs have now 
the potential to meet the requested CT time constraints. 




Figure 1: C-arm system illustration (Axiom Artis Zeego, 
Siemens Healthcare, Forchheim, Germany). 
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Figure 2: Volume renderings based on the reconstruction of 
2D X-ray projections of a rabbit. 



The volume reconstruction step for recent C-arm systems 
with flat panel detector can be considered as a prototype 
for modern clinical CT systems. Interventional C-arm CTs, 
such as the one sketched in Fig. [T] perform the rotational 
acquisition of 496 high resolution X-ray projection images 
(1248x960 pixels) in 20 seconds [2]. This acquisition phase 
sets a constraint for the maximum reconstruction time to 
attain real-time reconstruction. In practice filtered backpro- 
jection (FBP) methods such as the Feldkamp algorithm [3] 
are widely used for performance reasons. The algorithm 
consists of 2D pre-processing steps, backprojection, and 3D 
post-processing. Volume data is reconstructed in the back- 
projection step, making it by far the most time consuming 
part [3]. It is characterized by high computational inten- 
sity, nontrivial data dependencies, and complex numerical 
evaluations but also offers an inherent embarrassingly par- 
allel structure. In recent years hardware-specific optimiza- 
tion of the Feldkamp algorithm has focused on CPUs |5] [6] 
and IBM Cell processors [7j. For CPUs in particular, large 
performance gains compared to CPUs were reported [6] or 
documented by the standardized RabbitCT benchmark [S] 
[9]. Available studies with standard CPUs indicate that large 
servers are required to meet GPU performance [10]. In this 
report we also use the RabbitCT environment, which de- 
fines a clinically relevant test case and is supported by in- 
dustry. RabbitCT is an open competition benchmark based 
on C-arm CT images of a rabbit (see Fig. [2|. It allows to 
compare the manifold of existing hardware technologies and 
implementation alternatives for reconstruction scenarios by 
applying them to a fixed, well-defined problem. 

At research level, recent reconstruction methods use more 
advanced iterative techniques, which can provide supe- 
rior image quality in special cases like sparse or irregular 
data [11]. Other algorithms are used to reconstruct time- 
resolved volumes ("3D+f) [12], e.g., for cardiac imaging. 
However, both approaches incorporate several backpro- 
jection steps, making performance improvements on the 
RabbitCT benchmark valuable for them as well. The 
same holds for industrial CTs in material science used in 
nondestructive testing (NDT), which additionally run at 
higher resolution and thus further increase the computa- 
tional requirements [13]. The high computational demand 
of backprojection algorithms together with the growing ac- 
ceptance of parallel computing in medical applications make 
them interesting new candidates on the interface to High 



Performance Computing. 

1.2 Modern processors 

The steadily growing transistor budget is still pushing for- 
ward the compute capabilities of commodity processors at 
rather constant price and clock speed. Improvement and 
scaling of established architectural features in the core (like, 
e.g., SIMD and SMT; see below for details) in addition to 
increasing the number of cores per chip lead to peak perfor- 
mance levels which enable standard CPUs to meet the time 
constraints of interventional CT systems. Optimized single 
core implementations are thus mandatory for reconstruction 
algorithms. Complemented by standard optimization tech- 
niques, a highly efficient SIMD (Single Instruction Multiple 
Data) code is the key performance component. Owing to 
the nontrivial data parallelism in the reconstruction scheme, 
SIMD optimizations need to be done at low level, i.e., down 
to assembly language. However, these efforts will pay off 
for future computer architectures since an efficient SIMD 
implementation will become of major importance to further 
benefit from increasing peak performance numbers. 

Scaling SIMD width is a safe bet to increase raw core per- 
formance and optimize energy efficiency (i.e., maximize the 
flop/ Watt ratio), which is known to be the critical compo- 
nent in future HPC systems. Recently Intel has played the 
game by doubling the SIMD width from the SSE instruc- 
tion set (128-bit registers) to AVX [14] (256-bit registers, 
with larger widths planned for future designs), which is im- 
plemented in the "Sandy Bridge" architecture. More ongo- 
ing projects are pointing in the same direction, like Intel's 
Many Integrated Core (MIC) [15] architecture or the Chi- 
nese Godson-3B chip [TS] . Wider SIMD units do not change 
core complexity substantially since the optimal instruction 
throughput does not depend on the SIMD width. However, 
the benefit in the application strongly depends on the level 
and the structure of data parallelism as well as the capability 
of the compiler to detect and exploit it. Compilers are very 
effective for simple code patterns like streaming kernels [17] , 
while more complex loop structures require manual inter- 
vention, at least to the level of compiler intrinsics. This is 
the only safe way to utilize SIMD capabilities to their full 
potential. The impact of wider SIMD units on programming 
style is still unclear since this trend currently starts to ac- 
celerate. Of course wide SIMD execution is most efficient 
for in-cache codes because of the large "DRAM gap." 

Simultaneous Multi- Threading (SMT) is another technol- 
ogy to improve core utilization at low architectural impact 
and energy costs. It is obvious that SMT should be most 
beneficial for those "in cache codes" that have limited sin- 
gle thread efficiency due to bubbles in arithmetic/logic 
pipelines, and where cache bandwidth is not the dominat- 
ing factor. Naively one would not expect improvements 
from SMT if a code is fully SIMD-vectorized since SIMD is 
typically applied to simple data-parallel structures, which 
are a paradigm for efficient pipeline use. Since many pro- 
grammers do not care about pipeline occupancy in their 
application the benefit of SMT is often tested in an experi- 
mental way, without arriving at a clear explanation for why 
it does or does not improve performance. 

This paper is organized as follows. In Sect. [3] we perform a 



first analysis of the backprojection algorithm implemented 
in the RabbitCT framework using simple metrics like arith- 
metic throughput and memory bandwidth as guidelines for 
estimating performance. We address processors from four 
generations of Intel's x86 family (Harpertown, Westmere, 
Westmere EX, and Sandy Bridge). Basic optimization rules 
such as minimizing overall work are applied. In Sect. [4] we 
show how to efficiently vectorize the inner loop kernel using 
SSE and AVX instructions, and discuss the possible benefit 
of multithreading with SMT. Sect. [5] provides an in-depth 
performance analysis, which will show that simple band- 
width or arithmetic throughput models are inadequate to 
estimate the performance of the algorithm. OpenMP paral- 
lelization and related optimizations like ccNUMA placement 
and bandwidth reductions are discussed in Sect. [6] Perfor- 
mance results for cores, sockets, and nodes on all four plat- 
forms are given in Sect. where we also interpret the effect 
of the different optimizations discussed earlier and validate 
our performance model. Finally we compare our results with 
current GPGPU implementations in Sect. [8] 

2. EXPERIMENTAL TESTBED 

A selection of modern Intel x86-based multicore processors 
(see Table [TJ has been chosen to test the performance po- 
tential of our optimizations. All of these chips feature a 
large outer level cache, which is shared by two (Core 2 Quad 
"Harpertown"), four (Sandy Bridge), six (Westmere EP), or 
ten cores (Westmere EX). We refer to the maximum num- 
ber of cores sharing an outer level L2/L3 cache as an "L2/L3 
group." 

With the initiation of the Core i7 architecture the mem- 
ory subsystem of Intel processors was redesigned to allow 
for a substantial increase in memory bandwidth, at the 
price of introducing ccNUMA on multisocket servers. At 
the same time Intel also relaunched simultaneous multi- 
threading (SMT, a.k.a. "Hyper-Threading") with two SMT 
threads per physical core. The most recent processor, Sandy 
Bridge, is equipped with a new instruction scheduler, sup- 
ports the new AVX SIMD instruction set extension, and 
has a new last level cache subsystem (which was already 
present in Nehalem EX). The 10-core Intel Westmere EX is 
not mainly targeted at HPC clusters but at large mission- 
critical servers. It reflects the performance maximum for x86 
shared-memory nodes. A comprehensive summary of the 
most important processor features is presented in Table [1] 
Note that the Sandy Bridge model used here is a desktop 
variant, while the other processors are of the server ("Xeon") 
type. Table [1] also contains bandwidth measurements for a 
simple update benchmark: 

for(int i=0; i<N; ++i) 
a[i] = s * a [ i ] ; 

This benchmark reflects the data streaming properties of 
the reconstruction algorithm and is thus better suited than 
STREAM [17] as a baseline for a quantitative performance 
model. 

We use the Intel C/CH — h compiler in version 12.0; since 
most of our performance-critical code is written in assembly 
language, this choice is marginal, however. Thread affin- 
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Figure 3: Setup geometry for generating the CT projection 
images. The size of the volume is always 256 3 mm 3 , but the 
number of voxels may vary. 

ity, hardware performance monitoring, and low-level bench- 
marking was implemented via the LIKWID tool suite [IS I 
I19| . using the tools likwid-pin, likwid-perfctr, and likwid- 
bench, respectively. 

3. THE ALGORITHM 
3.1 Code analysis 

The backprojection algorithm (as provided by the RabbitCT 
framework [5] , see Listing[T]) is usually implemented in single 
precision (SP) and exhibits a streaming access pattern for 
most of its data traffic. One volume reconstruction uses 
496 CT images (denoted by I) of 1248x960 pixels each 
(ISXxISY). The volume size is 256 3 mm 3 . MM is the voxel 
size and changes depending on the number of voxels. The 
most common resolution in present clinical applications is 
512 voxels in each direction (denoted by the problem size L). 
Each CT image is accompanied by a 3 x 4 projection matrix 
A, which projects an arbitrary point in 3D space onto the CT 
image. The algorithm computes the contributions to each 
voxel across all projection images. The reconstructed vol- 
ume is stored in array VOL. Voxel coordinates (indices) are 
denoted by x, y, and z, while pixel coordinates are called u 
and v. See Fig. [3] for the geometric setup. 

The aggregated size of all projection images is ~2.4 GB. The 
total data transfer volume of one voxel sweep comprises the 
loads from the projection image and an update operation 
(V0L[i]+=s, see line 14 II in Listing[T]) to the voxel array. The 
latter incurs 8 bytes of traffic per voxel and results (for prob- 
lem size 512 3 ) in a data volume of 1GB, or 496 GB for all 
projections. The traffic caused by the projection images 
is not easy to quantify since it is not a simple stream; it 
is defined by a "beam" of locations slowly moving over the 
projection pixels as the voxel update loop nest progresses. It 
exhibits some temporal locality since neighboring voxels are 
projected on proximate pixels of the image, but there may 
also be multiple streams with large strides. On the compu- 
tational side, the basic version of this algorithm performs 13 
additions, 5 subtractions, 17 multiplications, and 3 divides. 



Table 1: Test machine specifications. The cacheline size is 64 bytes for all processors and cache levels. The update benchmark 
results were obtained with the likwid-bench tool. 



Microarchitecture 


Intel Harpertown 


Intel Westmere 


T J_ 1 TIT j_ 1 '. A- 

Intel Westmere EX 


Intel Sandy Bridge 


Model 


Xeon X5482 


Xeon X5670 


Xeon E7- 4870 


Core i7-2600 


Label 


HPT 


WEM 


WEX 


SNB 


Clock [GHz] 


3.2 


2.66 (2.93 turbo) 


2.40 


3.4 (3.5 turbo) 


Node sockets/cores/threads 


2/8/- 


2/12/24 


4/40/80 


1/4/8 


Socket L1/L2/L3 cache 


4x32k/2x6M/- 


6x32k/6x256k/12M 


8x32k/8x256k/30M 


4x32k/4x256k/8M 


Bandwidths [GB/s]: 


Theoretical socket BW 


12.8 


32.0 


34.2 


21.3 


Update (1 thread) 


5.9 


15.2 


8.3 


16.5 


Update (socket) 


6.2 


20.3 


24.6 


17.3 


Update (node) 


8.4 


39.1 


98.7 





Listing 1: Voxel update loop nest for the plain backprojec- 
tion algorithm. This gets executed for each projection I. All 
variables are of type float unless indicated otherwise. The 
division into parts (see text) is only approximate since there 
is no 1:1 correspondence to the SIMD-vectorized code. 



wz = offset_2 
f or ( int z =0 ; 
wy = offset. 



z<L ; 

y ; 



z++, wz+=MM) { 



for (int y=0; y<L; y++, 
wx = offset_x; 
valtl=0.0f; valtr=0.0f 
valbl=0.0f; valbr=0.0f 



wy+=MM) { 



// Part 1 

for (int x=0; x<L; x++ , wx+=MM) { 

uw = (A [0] *wx + A [3] *wy + A [6] *wz+A [9] ) ; 
vw = (A [1] *wx + A [4] *wy + A [7] *wz + A [10] ) : 
w = (A [2] * wx + A [5] *wy + A [8] *wz+A [11] ) : 



u = uw 



* 1 . Of /w ; 



v = vw 



* l.Of/w; 



3.2 Simple performance models 

Based on this knowledge about data transfers and arithmetic 
operations we can derive a first rough estimate of expected 
upper performance bounds. The arithmetic limitation re- 
sults in 21 cycles per vectorized update (4 and 8 inner loop 
iterations for SSE and AVX, respectively), assuming full vec- 
torization and a throughput of one divide per cycle. This 
takes into account that all architectures under consideration 
can execute one addition and one multiplication per cycle, 
neglects the slight imbalance of additions versus multiplica- 
tions, and assumes that the pipelined repps instruction can 
be employed for the divisions (see Sect. I4.ll for details). 

On the other hand, runtime can also be estimated based on 
the data transfer volume and the maximum data transfer 
capabilities of the nodes measured with the synthetic up- 
date benchmark described in Sect. [21 The following table 
shows upper performance bounds for a full 512 3 reconstruc- 
tion based on arithmetic and bandwidth limitations on the 
four systems in the testbed (full nodes): 



int iu = (int)u, iv = ( int ) v ; 

scalx = u - (float) iu; 
scaly = v - (float) iv; 

// Part 2 

if (iv>=0 kk iv<ISY) { 
if (iu>=0 kk iu<ISX) 

valtl = I[iv*ISX + iu] ; 
if (iu>=-l kk iu<ISX-l) 

valtr = I[iv*ISX + iu+1]; 

} 

if (iv>=-l kk iv<ISY-l) { 
if (iu>=0 kk iu<ISX) 

valbl = I[(iv+l)*ISX + iu] ; 
if (iu>=-l kk iu<ISX-l) 

valbr = I[(iv+l)*ISX + iu+1]; 



HPT WEM WEX SNB 



} 

// Part 3 

vail = scaly*valbl + ( 1 . Of - scaly )* valtl ; 
valr = scaly*valbr + ( 1 . Of - scaly )* valtr ; 
fx = scalx*valr + ( 1 . Of - s calx )* vail ; 



V0L[z*L*L 
} // x 
} // y 
} // z 



+ y*L + x] += 1.0f/(w*w)*fx; 



Arithm. lim. [GUP/s] 4.86 6.75 13.9 5.31 
BW lim. [GUP/s] 1.06 4.90 11.1 2.15 

Performance is given in billions of voxel updates per second 
(GUP/s) Q where one "update" represents the reconstruction 
step of one voxel using a single image. The low arithmetic 
limitation for the single socket Sandy Bridge is caused by 
its wide AVX vector size and its faster clock. While above 
predictions seem to indicate a strongly memory-bound situ- 
ation, they are far from accurate: The runtime is governed 
by the number of instructions and the cycles it takes to ex- 
ecute these instructions. A reduction to the purely "useful" 
work, i.e., to arithmetic operations, can not be made since 
this algorithm is nontrivial to vectorize due to the scattered 
load of the projection image; it therefore involves many more 
non-arithmetic instructions (see Sect. |4~T1 for details). We 
will show later that a more careful analysis leads to a com- 



x We use SI prefixes, i.e., 1 GUP/s means 10 9 updates per 
second. This is inconsistent with a large part of the litera- 
ture on medical image reconstruction, where "G" is used as 
a binary prefix for 2 30 « 1.074 • 10 9 [5D] 



pletely different picture, and that further optimizations can 
change the bottleneck analysis considerably. 

In order to have a better view on low-level optimizations we 
divide the algorithm into three parts: 

1. Geometry computation: Calculate the index of the 
projection of a voxel in pixel coordinates 

2. Load four corner pixel values from the projection im- 
age 

3. Interpolate linearly for the update of the voxel data 

3.3 Algorithmic optimizations 

The first optimizations for a given algorithm must be on a 
hardware-independent level. Beyond elementary measures 
like moving invariant computations out of the inner loop 
body and reducing the divides to one reciprocal (thereby 
reducing the flop count to 31), a main optimization is to 
minimize the workload. Voxels located at the corners and 
edges of the volume are not visible on every projection, and 
can thus be "clipped off" and skipped in the inner loop. This 
is not a new idea, but the approach presented here improves 
the work reduction from 24% [21] to nearly 39%. 

The basic building block for all further steps is the update 
of a consecutive line of voxels in x direction, covered by 
the inner loop level in Listing [T] We refer to this as the 
"line update kernel." The geometry, i.e., the position of 
the first and the last relevant voxel for each projection im- 
age and line of voxels is precomputed. This information 
is specific for a given geometric setup, so it can be stored 
and used later during the backprojection loop. Reading the 
data from memory incurs an additional transfer volume of 
512 2 x496x4 bytes=496 MB/s (assuming 16-bit indexing), 
which is negligible compared to the other traffic. The ad- 
vantage of line- wise clipping is that the shape of the clipped 
voxel volume is much more accurately tracked than with the 
blocking approach described in [21] . 

The conditionals (lines [23l and [30l in Listing [1]), which en- 
sure correct access to the projection image, involve no mea- 
surable overhead for the scalar case due to the hardware 
branch prediction. However, for vectorized code they are 
potentially costly since an appropriate mask must be con- 
structed whenever there is the possibility that a SIMD vec- 
tor instruction accesses data outside the projection [5TJ. To 
remove this complication, separate buffers are used to hold 
suitably zero-padded copies of the projection images, so that 
there is no need for vector masks. The additional overhead 
is far outweighed by the performance advantage for vector- 
ized code execution. The conditionals are also effectively 
removed by the clipping optimization described above, but 
we need a code version without clipping for validating our 
performance model later. 

Note that a similar effect could be achieved by peeling off 
scalar loop iterations to make the length of the inner loop 
body a multiple of the SIMD vector size and ensure aligned 
memory access. However, this may introduce a significant 
scalar component especially for small problem sizes and large 
vector lengths. 



4. SINGLE CORE OPTIMIZATIONS 

For all further optimizations we choose an implementation of 
the line update kernel in C as the baseline. All algorithmic 
optimizations from Sect. 13.31 have already been applied. 

The performance of present processors on the core level re- 
lies on instruction-level parallelism (ILP) by pipelined and 
superscalar execution, and data-parallel operations (SIMD). 
We also regard simultaneous multithreading (SMT) as a sin- 
gle core optimization since it is a hardware feature to in- 
crease the efficiency of the execution units by filling pipeline 
bubbles with useful work: The idea is to duplicate parts 
of the hardware resources (control logic, registers) in order 
to allow a quasi-simultaneous execution of different threads 
while sharing other parts like, e.g., floating-point pipelines. 
In a sense this is an alternative to outer loop unrolling, which 
can also provide multiple independent instruction streams. 

We elaborate on the SIMD vectorization here and comment 
on the benefit of SMT in Sect. El 

4.1 SIMD vectorization 

No current compiler is able to efficiently vectorize the back- 
projection algorithm, so we have implemented the code di- 
rectly in x86 assembly language. Using SIMD intrinsics 
could ease the vectorization but adds some uncertainties 
with regard to register scheduling and hence does not allow 
full control over the instruction code. All data is aligned 
to enable packed and aligned loads/stores of vector registers 
(16 (SSE) or 32 (AVX) bytes with one instruction). 

The line update kernel operates on consecutive voxels. Part 
1 of the algorithm (see Sect. 13. 2j) is straightforward to vec- 
torize, since it is arithmetically limited and fully benefits 
from the increased register width. The division is replaced 
by a reciprocal. SSE provides the fully pipelined repps in- 
struction for an approximate reciprocal with reduced accu- 
racy compared to a full divide. This approximation is suffi- 
cient for this algorithm, and results in an accuracy similar to 
GPGPU implementations. An analysis of the impact of the 
approximate reciprocal on performance and accuracy is pre- 
sented in Sect. 17.21 The integer cast (line ll8[) is implemented 
via the vectorized hardware rounding instruction roundps, 
which was introduced with SSE4. 

Part 2 of the algorithm cannot be directly vectorized. Since 
the pixel coordinates from step 1 are already in a vec- 
tor register, the index calculation for, e.g., iv*ISX+iu and 
(iv+l)*ISX+iu (lines [25l [271 E3 and [34] in Listing QJ is 
done using the SIMD floating point units. There are pairs 
of values which can be loaded in one step because they are 
consecutive in memory: valtl/valtr, and valbl/valbr, 
respectively. Fig. 0] shows the steps involved to vectorize 
part 2 and the first linear interpolation. The conversion of 
the index into a general purpose register, which is needed for 
addressing the load of the data and the scattered pairwise 
loads, is costly in terms of necessary instructions. Moreover 
the runtime increases linearly with the width of the register, 
and the whole operation is limited by instruction through- 
put. There are different implementation options with the 
instructions available (see below). Finally the bilinear in- 
terpolation in part 3 is again straightforward to vectorize 
and fully benefits from wider SIMD registers. 
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4. Recover initial ordering: duplicate and blend 
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Figure 4: Vectorization of part 2 of the algorithm: The data 
is loaded pairwise into the vector registers. The interpola- 
tion of iterations 0,2 and 1,3 are computed simultaneously. 
Afterwards the results must be reordered for the second in- 
terpolation step. 



We consider two SSE implementations, which only differ in 
part 2 of the algorithm. Version 1 (VI) converts the floating 
point values in the vector registers to four quadwords and 
stores the result back to memory (cache, actually). Single 
index values are then loaded to general purpose registers 
one by one. Version 2 (V2) does not store to memory but 
instead shifts all values in turn to the lowest position in 
the SSE register, from where they are moved directly to a 
general purpose register using the cvtss2si instruction. 

Note that any further inner loop unrolling beyond what is 
required by SIMD vectorization would not show any benefit 
due to register shortage; however, as will be shown later, 
SMT can be used to achieve a similar effect. 

4.2 AVX implementation 

In theory, the new AVX instruction set extension doubles 
the performance per core. The backprojection cannot fully 
benefit from this advantage because the number of required 
instructions increases linearly with the register width in part 
2 of the algorithm. For arbitrary SIMD vector lengths a 
hardware gather operation would be required to prevent this 
part from becoming a severe bottleneck. 

Also the limited number of AVX instructions that natively 
operate on 256-bit registers impedes more sophisticated vari- 
ants of part 2; only the simple version VI could be ported. 
Implementation of V2 would be possible only at the price of 
a much larger instruction count, so this alternative was not 
considered. Still an improvement of 25% could be achieved 
with the AVX kernel on Sandy Bridge (see Sect. [7] for de- 
tailed performance results). 

Intel already announced the successor AVX2, which will be 
implemented in the Intel Haswell processor in 2012. AVX2 
will fix issues preventing better performance for the backpro- 
jection algorithm which are the hardware gather instruction 
and a more complete instruction set operating on the full 
SIMD register width. 



5. IN-DEPTH PERFORMANCE ANALYSIS 
5.1 Analytic performance model 

Popular performance models for bandwidth-limited algo- 
rithms reduce the influence on the runtime to the algo- 
rithmic balance, taking into account the sustained main 
memory performance [22]. This assumption works well 
in many cases, especially when there is a large gap be- 
tween arithmetic capabilities and main memory bandwidth. 
Recently it was shown [23] that this simplification is prob- 
lematic on newer architectures like Intel Core i7 because 
of their exceptionally large memory bandwidth per socket; 
e.g., a single Nehalem core cannot saturate the memory 
interface [24]. Moreover the additional L3 cache decouples 
core instruction execution from main memory transfers and 
generally provides a better opportunity to overlap data 
traffic between different levels of the memory hierarchy. 
Note that in a multithreaded scenario the simple bandwidth 
model can indeed work as long as multiple cores are able 
to saturate the socket bandwidth. This property depends 
crucially on the algorithm, of course. 

In [23] we have introduced a more detailed analytic method 
for cases where the balance model is not sufficient to predict 
performance with the required accuracy, and the in-cache 
data transfers account for a significant fraction of overall 
runtime. This model is based on an instruction analysis 
of the innermost loop body and runtime contributions of 
cacheline transfer volumes through the whole memory hi- 
erarchy. We first cover single-threaded execution only and 
then generalize to multithreading on the socket once the ba- 
sic performance limitations are understood. 

A useful starting point for all further analysis is the single- 
thread runtime spent executing instructions with data 
loaded from LI cache. The Intel Architecture Code An- 
alyzer (IACA) [25] was used to analytically determine the 
runtime of the loop body. This tool calculates the raw 
throughput according to the architectural properties of the 
processor under the assumption that all data resides in the 
LI cache. It supports Westmere and Sandy Bridge (includ- 
ing AVX) as target architectures. The results for Westmere 
are shown in the following table for the two SSE kernel 
variants described above (all entries except fiOPs are in 
core cycles): 

Issue port 



Variant 
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TP 


^tOPs 


CP 


VI 


15 


21 


24 


3 


3 


19 


24 


85 


54 


V2 


20 


27 


16 


1 


1 


20 


27 


85 


71 



Execution times are calculated separately for all six issue 
ports (0. ..5). (A /iOP is a RISC-like "micro-instruction;" 
x86 processors perform an on-the-fly translation of machine 
instructions to /iOPs, which are the "real" instructions that 
get executed by the core.) Apart from the raw throughput 
(TP) and the total number of fj,OPs the tool also reports 
a runtime prediction taking into account latencies on the 
critical path (CP). Based on this prediction VI should be 
faster than V2 on Westmere. However, the measurements in 
Table [2] show the opposite result. The high pressure on the 
load issue port (2) together with an overall high pressure on 
all ALU issue ports (0, 1, and 5) seems to be decisive. In V2 
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(a) Harpertown (b) Westmere (c) Sandy Bridge (SSE) (d) Westmere EX 

Figure 5: Performance Analysis: Runtime contributions from instruction execution and necessary cacheline transfers. Each 
arrow is a 64-byte cacheline transfer. The total data volume in bytes is indicated on the left of each group of arrows. On 
the right we show the data transfer capabilities between hierarchy levels and the resulting transfer time in core cycles. This 
assumes that the transfer time is solely determined by bandwidth capabilities, and any latency influence is ignored. For data 
transfers from main memory the contribution in memory/FSB cycles are translated to core cycles. In-core execution times 
are measured values from Table [2] scaled to a complete cacheline. 



HPT WEM WEX SNB 
VI SSE 62l 6L6 59^6 44.4 
V2 SSE 57.4 51.5 54.7 50.0 
VI AVX 76.2 
Table 2: Measured execution times (one core) in cycles for 
one iteration of the SIMD-vectorized kernel (i.e., 4 or 8 voxel 
updates) with all operands residing in LI cache. 

the pressure on port 2 is much lower, although the overall 
pressure on all issue ports is slightly larger. 

Below we report the results for the Sandy Bridge architec- 
ture with SSE and AVX. The pressure on the ALU ports is 
similar, but due to the doubled SSE load performance Sandy 
Bridge needs only half the cycles for the loads in kernel VI. 
VI is therefore faster than V2 on Sandy Bridge (see Table[2]). 
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So far we have assumed that all data resides in the LI cache. 
The data transfers required to bring cachelines into LI and 
back to memory are modeled separately. We assume that 
there is no overlap between data transfers and instruction 
execution. This is true at least for the LI cache: It can either 
communicate with L2 to load or evict a cacheline, or it can 
deliver data to the registers, but not both at the same time. 
As a first approximation we also (pessimistically) assume 
that this "no-overlapping" condition holds for all caches, and 



that a data transfer between any two adjacent levels in the 
memory hierarchy does not overlap with anything else. Since 
the smallest transfer unit is a 64-byte cacheline, the analysis 
will from now on be based on a full "cacheline update" (16 
four- byte voxels), which corresponds to four (two) inner loop 
iterations when using SSE (AVX). 

We only consider the data traffic for voxel updates; the im- 
age data traffic is negligible in comparison, hence we assume 
that all image data comes from LI cache. It takes two cycles 
to transfer one cacheline between adjacent cache levels over 
the 256-bit unidirectional data path. Every modified line 
must eventually be evicted, which takes another two cycles. 
Figure [5] shows a full analysis, in which the core execution 
time for a complete cacheline update is based on the mea- 
sured cycles from Table [2] On the three architectures with 
L3 cache the simplification is made that the "Uncore" part 
(L3 cache, memory interface, and QuickPath interconnect) 
runs at the same frequency as the core, which is not strictly 
true but does not change the results significantly. It was 
shown for the Nehalem-based architectures (Westmere and 
Westmere EX) that they can overlap instruction execution 
with reloading data from memory to the last level cache [24] . 
Hence, the model predicts that the in-core execution time 
is much larger than all other contributions, which makes 
this algorithm limited by instruction throughput for single 
core execution. On Sandy Bridge, the AVX kernel requires 
76.2 cycles for one vectorized loop iteration (eight updates). 
This results in 152 cycles instead of 178 cycles (SSE) for one 
cacheline update. 

Based on the runtime of the loop kernel we can now estimate 
the total required memory bandwidth for multithreaded exe- 
cution if all cores on a socket are utilized, and also derive the 
expected performance (we consider the full volume without 



clipping): 

HPT WEM WEX SNB ^VX) 

BW/core [GB/s] L7 L9 L~5 2^5 51) 
BW/socket [GB/s] 6.8 11.2 11.6 10.0 12.0 
Perf. [GUP/s] 0.85 1.42 1.45 1.25 1.51 

We conclude that the multithreaded code is bandwidth- 
limited only on Harpertown, since the required socket band- 
width is above the practical limit given by the update bench- 
mark (see Table Q]). All other architectures are below their 
data transfer capabilities for this operation and should show 
no benefit from further bandwidth-reducing optimizations 
(see Sect. EH)) . 

5.2 ILP optimization and SMT 

At this point the analysis still neglects the possible bene- 
fit from SMT. SMT can significantly improve the efficiency 
of the floating point units for codes that are limited by in- 
struction throughput and suffer from underutilization of the 
arithmetic units due to dependencies or instruction schedul- 
ing issues. This is definitely the case here, as indicated 
by the discrepancy between the "throughput" and "critical 
path" predictions in the previous section. Due to the com- 
plex loop body register dependencies are unavoidable, re- 
sulting in many pipeline bubbles. Outer loop unroll and jam 
(interleaving two outer loop iterations in the inner body) is 
out of the question due to register shortage, but SMT can 
do a similar job and provide independent instruction streams 
using independent register sets. Since memory bandwidth is 
no limitation on all SMT-enabled processors considered here, 
running two threads on the two virtual cores of each physical 
core is expected to reduce the cycles taken for the cacheline 
update. However, the effect of using SMT is difficult to esti- 
mate quantitatively. See Sect. below for complete parallel 
results 

6. OPENMP PARALLELIZATION 

OpenMP parallelization of the algorithm is straightforward 
and works with all optimizations discussed so far. For the 
thread counts and problem sizes under consideration here it 
is sufficient to parallelize the loop that iterates over all voxel 
volume slices (loop variable z in Listing!]]). However, due to 
the clipped-off voxels at the edges and corners of the volume, 
simple static loop scheduling with default chunksize leads to 
a strong load imbalance. This can be easily corrected by 
using block-cyclic scheduling with a small chunksize (e.g., 
static , 1). 

Images are produced one by one during the C-arm rotation, 
and could at best be delivered to the application in batches. 
Since the reconstruction should start as soon as images be- 
come available, a parallelization across images was not con- 
sidered. 

As shown in Sect. [5] the socket-level performance analysis 
does not predict strong benefits from bandwidth-reducing 
optimizations except on the Harpertown platform. However, 
since one can expect to see more bandwidth-starved proces- 
sor designs with a more unbalanced ratio of peak perfor- 
mance to memory bandwidth in the future, we still consider 



bandwidth optimizations important for this algorithm. Fur- 
thermore, ccNUMA architectures have become omnipresent 
even in the commodity market, making locality and band- 
width awareness mandatory. In the following sections we 
will describe a proper ccNUMA page placement strategy 
for voxel and image data, and a blocking optimization for 
bandwidth reduction. The reason why we present those op- 
timizations in the context of shared-memory parallelization 
is that they become relevant only in the parallel case, since 
bandwidth is not a problem on all architectures for serial 
execution (see Sect. 15.1)) , 

6.1 ccNUMA placement 

The reconstruction algorithm uses essentially two relevant 
data structures: the voxel array and the image data arrays. 
Upon voxel initialization one can easily employ first-touch 
initialization, using the same OpenMP loop schedule (i.e., 
access pattern) as in the main program loop. This way each 
thread has local access (i.e., within its own ccNUMA do- 
main) to its assigned voxel layers, and the full aggregate 
bandwidth of a ccNUMA node can be utilized. 

Although the access to the projection image data is much 
less bandwidth-intensive than the memory traffic incurred 
by the voxel updates, ccNUMA page placement was im- 
plemented here as well. As mentioned in Sect. 13.31 the 
padded projection buffers are explicitly allocated and initial- 
ized in each locality domain, and a local copy is shared by all 
threads within a domain. Since the additional overhead for 
the duplication is negligible, this ensures conflict-free local 
access to all image data. The time taken to copy the images 
to the local buffers is included in the runtime measurements. 

6.2 Blocking/unrolling 

In order to reduce the pressure on the memory interface 
we use a simple blocking scheme for the outer loop over all 
images: Projections are loaded and copied to the padded 
projection buffers in small chunks, i.e., b images at a time. 
The line update kernel (see Sect. |4| for a certain pair of 
(y, z) coordinates is then executed b times, once for each 
projection. This corresponds to a 6-way unrolling of the 
image loop and a subsequent jam into the next-to- innermost 
voxel loop (across the y voxel coordinate). At the problem 
sizes studied here, all the voxel data for this line can be 
kept in the LI cache and reused b — 1 times. Hence, the 
complete volume is only updated in memory 496/b instead of 
496 times. Relatively small unrolling factors between 2 and 
8 are thus sufficient to reduce the bandwidth requirements 
to uncritical levels even on "starved" processors like the Intel 
Harpertown. 

This optimization is so effective that it renders proper cc- 
NUMA placement all but obsolete; we will thus not report 
the benefit of ccNUMA placement in our performance re- 
sults, although it is certainly performed in the code. 

7. RESULTS 

In order to evaluate the benefit of our optimizations we have 
benchmarked different code versions with the 512 3 case on 
all test machines. RabbitCT includes a benchmarking ap- 
plication, which takes care of timing and error checking. 
It reports total runtime in seconds for the complete back- 
projection. We performed additional hardware performance 



counter measurements using the likwid-perfctr tool. Likwid- 
perfctr can produce high-resolution timelines of counter data 
and useful derived metrics on the core and node level with- 
out changes to the source code. Unless stated otherwise we 
always report results using two SMT threads per core. For 
all architectures apart from Sandy Bridge the line update 
kernel version V2 was used. On Sandy Bridge results for 
the SSE kernel VI as well as for the AVX port of the VI 
kernel are presented. 



7.1 Validation of analytical predictions 

To validate the predicted performance of the analytic model 
(see Sect. [5]), single-socket runs were performed without the 
clipping optimization and SMT. Blocking was used on the 
Harpertown platform only, to ensure that execution is not 
dominated by memory access. The following table shows the 
measured performance and the deviation against the model 
prediction: 



HPT 



WEM 



WEX 



SNB 



SNB 
(AVX) 



0.75 



Perf. 
[GUP/s] 
deviation -13.3% 



1.20 
-18.3% 



1.30 
-11.5% 



1.11 1.28 
-12.6% -18.0% 



This demonstrates that the model has a reasonable predic- 
tive power. It has been confirmed that the contribution of 
data transfers indeed vanishes against the core runtime, de- 
spite the fact that the total transfer volume is high and a 
first rough estimate based on data transfers and arithmetic 
throughput alone (Sect. [3]) predicted a bandwidth limitation 
of this algorithm on all machines. 

As a general rule, the IACA tool can provide a rough esti- 
mate of the innermost loop kernel runtime via static code 
analysis. Still it is necessary to further enhance the ma- 
chine model to improve the accuracy of the predictions. Es- 
pecially the ability of the out of order scheduler to exploit 
superscalar execution was overestimated and has led to qual- 
itatively wrong predictions. 

Note that this example is an extreme case with all data 
transfers vanishing against core runtime. However, the 
approach also works for bandwidth-limited codes, as was 
shown in 1231. 



7.2 Impact of optimizations on accuracy 

One of the costly operations in this algorithm is the di- 
vide involved. A possible optimization is to use the fully 
pipelined reciprocal (repps), which provides an approxima- 
tion with 12-bit accuracy compared to the 24-bit accuracy 
of the regular divide instruction. The accuracy of the recip- 
rocal can be improved to at least 21 bits using a Newton- 
Raphson iteration [26]. This requires four additional arith- 
metic instructions in the implementation. The following ta- 
ble shows the peak signal-to-noise ratio (PSNR) and perfor- 
mance of the three alternatives: Regular divide instruction, 
reciprocal, and reciprocal with Newton-Raphson iteration, 
on the Westmere test machine for the fully optimized case 
(full node). 



Performance 
[GUP/s] 



PSNR [dB] 



divps 3.79 
repps 4.21 
Newton-Raphson 3.79 



103 
67.7 
103 



As usual in image processing, the PSNR was determined as 
the logarithm of a mean quadratic deviation from a reference 
image: 



PSNR = 10 • lo gl 



M 



L~ 3 ]T [V(x,y,z)-R(x,y,z)] 



(1) 



Here V(x,y,z) is the reconstructed voxel grayscale value 
at coordinates (x,y,z) scaled to the interval [0, M], and 
R(x, y, z) is a reference voxel with the "correct" result. The 
higher the PSNR, the more accurate the reconstruction. 

There is no significant difference neither in performance 
nor accuracy between the regular divide and repps with 
Newton-Raphson. The performance improvement when 
using repps alone is 10%, at an accuracy which is still 
better than for the published GPGPU implementations 
(63-65 dB). In the following section we discuss performance 
results using this latter version. 

7.3 Parallel results 

Figures [6](a)-(d) display a summary of all performance re- 
sults on node and socket levels, and parallel scaling inside 
one socket for the best version on each architecture. All ma- 
chines show nearly ideal scaling inside one socket when using 
physical cores only. With SMT, the benefit is considerable 
on Sandy Bridge (33%) and Westmere (31%), and a little 
smaller on Westmere EX (25%). The large effect on Sandy 
Bridge may be attributed to a higher number of bubbles in 
the pipeline, as indicated by the larger discrepancy between 
the "throughput" and "critical path" cycles in the AVX loop 
kernel (see Sect. 15.20 . Scalability from one to all sockets 
of the node is also close to perfect for the multisocket ma- 
chines, with the exception of Westmere EX, on which there 
is a slight load imbalance due to 80 threads working on only 
512 slices. 

Depending on the architecture, SSE vectorization boosts 
performance by a factor of 2-3 on the socket level. As ex- 
plained earlier (see Sect. 2]), part 2 of the algorithm prohibits 
the optimal speedup of 4 because its runtime is linear in the 
SIMD vector length. Work reduction through clipping alone 
shows only limited effect due to load imbalance, but this can 
be remedied by an appropriate cyclic OpenMP scheduling, 
as described in Sect. [5] This kind of load balancing not only 
improves the work distribution but also leads to a more simi- 
lar access pattern to the projection images across all threads. 
This can be seen in Fig.[7|(a), which shows the cacheline traf- 
fic between the L2 and LI cache during a 6-thread run on 
one Westmere socket with all optimizations except clipping 
(only 3 cores are shown for clarity) . Although the amount of 
work, i.e., the number of voxels, is perfectly balanced across 
all threads, there is a strong discrepancy in cacheline traf- 
fic between threads when standard static scheduling is used. 
The reason for this is that the projections of voxel lines onto 
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Figure 6: Scalability and performance results for the 512 test case on all platforms. In-socket scalability was tested using 
the best version of the SIMD- vectorized line update kernel on each system (AVX-V1 on Sandy Bridge, SSE-V2 on all others). 
The practical performance goal for complete reconstruction (20 seconds runtime, corresponding to 3.33GUP/s) is indicated 
as a dashed line. Gflop/s numbers have been computed assuming 31 flops per optimized (scalar) inner loop iteration. Note 
the scale change between the left and right pairs of graphs. 
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(b) OpenMP cyclic (static, 1) scheduling 
Figure 7: Timeline view of the L2 to LI cacheline traffic for 
three cores of a 6-thread run without clipping on one West- 
mere socket, using (a) standard static OpenMP scheduling 
and (b) cyclic "static,!" scheduling. 



the detector are quite different for lines that are far apart 
from each other in the volume, which leads to vastly different 
access patterns to the image data, and hence very dissimilar 
locality properties. Cyclic scheduling removes this variation 
(see Fig. [7](b)). 

Cache blocking has little to no effect on all architectures ex- 
cept Harpertown, as predicted by our analysis. Fig.[S]shows 
timelines for socket floating point performance and band- 
width on one Westmere socket, comparing blocked/non- 
blocked and SMT/non-SMT variants. Fig. [S](a) clearly 
demonstrates the performance boost of SMT in contrast 
to the very limited effect of blocking. On the other hand, 
the blocked implementation significantly reduces the band- 
width requirements (Fig. 0(b)). The blocked variants have 
a noticeable amplitude of variations while the nonblocked 
versions appear smooth. In the inset of Fig. [8](b) we show a 
zoomed-in view with finer time resolution, which indicates 
that the frequency of bandwidth variations is much larger 
without blocking; this is plausible since the bandwidth is 
dominated by the voxel volume in this case. With blocking, 
individual voxel lines are transferred in "bursts" with phases 
of inactivity in between, where image data is read at low 
bandwidth. 

The benefit of AVX on Sandy Bridge falls short of expec- 
tations for the same reason as in the SSE case. Still it is 
remarkable that the desktop Sandy Bridge system outper- 
forms the dual-socket Harpertown server node, which fea- 
tures twice the number of cores at a similar clock speed. 
Both Westmere and Westmere EX meet the performance 
requirements of at most 20 sec for a complete volume recon- 
struction. The Westmere EX node is, however, not compet- 
itive due to its unfavorable price to performance ratio. It is 
an option if absolute performance is the only criterion. 



8. CPU VS. GPGPU 
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Figure 8: Performance counter timeline monitoring of 
floating-point performance (a) and memory bandwidth (b), 
comparing blocked/nonb locked and SMT/non-SMT vari- 
ants of the best implementation on one Westmere socket 
at 100 ms resolution. The inset in (b) shows a zoomed-in 
view with 2 ms resolution. 
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Figure 9: Performance comparison between the best re- 
ported GPGPU implementations in OpenCL and CUDA 
and our CPU versions on the systems in the test bed for 
problem sizes 512 3 and 1024 3 , respectively. There is cur- 
rently no working CUDA implementation for the latter case. 
The practical performance goal for complete reconstruction 
of a 512 3 volume (3.33 GUP/s) is indicated by a dashed line. 



Since the backprojection algorithm is well suited for GP- 
GPUs, the performance leaders of the open competition 
benchmark have traditionally been GPGPU implementa- 
tions in OpenCL and CUDA. The gap to the fastest CPU 
version reported on the RabbitCT Web site [9] before this 
work was started was very large. For the reasons given in the 
derivation of the performance model, simple bandwidth or 
peak performance comparisons are inadequate to estimate 
the expected reconstruction speed advantage of GPGPUs, 
although it should certainly lie inside the usual corridor of 
4-10 when comparing with a full CPU socket. An example 
of a well-optimized GPU code was published recently [27j . 
Our implementation shows that current x86 multicore chips 
are truly competitive (see Fig. [9}, even when considering the 
price/performance ratio. Beyond the clinically relevant 512 3 
case, industrial applications need higher resolutions. This is 
a problem for GPGPU implementations because the local 
memory on the card is often too small to hold the complete 
voxel volume, causing extra overhead for moving partial vol- 
umes in and out of the local memory and leading to a more 
complex implementation. If the price for the hardware is 
unimportant, a Westmere EX node is an option that can 
easily outperform GPGPUs. Note that the unusually large 
gap between the performance levels at 512 3 and 1024 3 on 
this architecture may be attributed to better load balancing 
when 80 threads work on 1024 instead of 512 slices. It also 
lifts the performance efficiency (fraction of node peak) on 
this system to about 50%, matching the level achieved by 
the other multicore nodes already on the smaller problem. 
This compares to 26% for the CUDA code on the GTX480 
card, assuming the same number of flops per voxel update. 

The results for the Sandy Bridge desktop system and the 
good scalability even of the optimized algorithm promise 
even better performance levels on commodity multicore sys- 
tems in the future. Note that it would be possible to pro- 
vide a simple and efficient distributed memory paralleliza- 
tion of the algorithm for even faster reconstruction. "Micro- 



clusters" based on cheap desktop technology could then eas- 
ily meet any time constraint at extremely low cost. However, 
a comparison based on the hardware cost alone is certainly 
too simplistic, especially when taking into account the over- 
all cost for a complete CT scanner including maintenance. 

9. CONCLUSIONS AND OUTLOOK 

We have demonstrated several algorithmic and low-level op- 
timizations for a CT backprojection algorithm on current 
Intel x86 multicore processors. Highly optimizing compil- 
ers were not able to deliver useful SIMD-vectorized code. 
Our implementation is thus based on assembly language 
and vectorized using the standard instruction set extensions 
SSE and AVX. The results show that commodity hardware 
can be competitive with highly tuned GPU implementations 
for the clinically relevant 512 3 voxel case at the same level 
of accuracy. Nonpipelined divide instructions (divps) or 
a fast pipelined version (repps) with subsequent Newton- 
Raphson iteration provide better accuracy at a 10% per- 
formance penalty. Our results showed that it is necessary 
to consider all aspects of processor and system architecture 
in order to reach best performance, and that the effects of 
different optimizations are closely connected to each other. 
The benefit of the AVX instruction set on Sandy Bridge was 
limited due to the lack of a gathered load and the small num- 
ber of instructions that natively operate on the full SIMD 
register width. This relevant algorithm can achieve very 
good efficiencies on commodity processors and it would be 
a natural step to further improve performance with a dis- 
tributed memory implementation. At higher resolutions, 
which are used in industrial applications, multicore systems 
are frequently the only choice (apart from expensive custom 
solutions) . 

Future work includes a more thorough analysis and opti- 
mization of the AVX line update kernel, and an inclusion 
of the new AVX2 gather operations once they become avail- 



able. 
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